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Abstract. The binary string matching problem consists in finding all 
the occurrences of a pattern in a text where both strings are built on 
a binary alphabet. This is an interesting problem in computer science, 
since binary data are omnipresent in telecom and computer network ap- 
plications. Moreover the problem finds applications also in the field of 
image processing and in pattern matching on compressed texts. Recently 
it has been shown that adaptations of classical exact string matching al- 
gorithms are not very efficient on binary data. In this paper we present 
two efficient algorithms for the problem adapted to completely avoid 
any reference to bits allowing to process pattern and text byte by byte. 
Experimental results show that the new algorithms outperform existing 
solutions in most cases. 

Keywords: string matching, binary strings, experimental algorithms, 
compressed text processing, text processing. 

1 Introduction 

Given a text t and a pattern p over some alphabet S of size a, the string 
matching problem consists in finding all occurrences of the pattern p in the text 
t. It is a very extensively studied problem in computer science, mainly due to 
its direct applications to such diverse areas as text, image and signal processing, 
speech analysis and recognition, information retrieval, computational biology 
and chemistry, etc. 

In this article we consider the problem of searching for a pattern p of length 
m in a text t of length n, with both strings are built over a binary alphabet, 
where each character of p and t is represented by a single bit. Thus memory 
space needed to represent t and p is, respectively, [n/8] and \m/8] bytes. 

This is an interesting problem in computer science, since binary data are 
omnipresent in telecom and computer network applications. Many formats for 
data exchange between nodes in distributed computer systems as well as most 
network protocols use binary representations. Binary images often arise in dig- 
ital image processing as masks or as the results of certain operations such as 
segmentation, thresholding and dithering. Moreover some input/output devices, 
such as laser printers and fax machines, can only handle binary images. 

The main reason for using binaries is size. A binary is a much more compact 
format than the symbolic or textual representation of the same information. 



Consequently, less resources are required to transmit binaries over the network. 
For this reason the binary string matching problem finds applications also in 
pattern matching on compressed texts, when using the Huffman compression 
strategy [KS05,SD06,FG06]. 

Observe that the text t, and the pattern p to search for, cannot be directly 
processed as strings with a super-alphabet [Frc02,FG06], i.e., where each group 
of 8 bits is considered as a character of the text. This is because an occurrence of 
the pattern can be found starting at the middle of a group. Suppose, for instance, 
that t = 011001001000100110100101000101001001 and p = 0100110100. If we 
write text and pattern as groups of 8 bits then wc obtain 

t = 01100100 10001001 10100101 00010100 1001 
p= 01001 10100 

The occurrence of the pattern at position 11 of the text cannot be located 
by a classical pattern matching algorithm based on super-alphabet. 

It is possible to simply adapt classical efficient algorithms for exact pattern- 
matching to binary-matching with minor modifications. We can substitute in 
the algorithms reference to the character at position i with reference to the 
bit at position i. Roughly speaking we can substitute occurrences of t[i] with 
GETBiT(t, i) which returns the i-th bit of the text t. This transformation does 
not affect the time complexity of the algorithm but is time consuming and in 
general could be not very efficient. 

In [KBN07] Klein and Ben-Nissan proposed an efficient variant of the Boyer- 
MoORE algorithm for the binary case without referring to bits. The algorithm 
is projected to process only entire blocks such as bytes or words and achieves a 
significantly reduction in the number of text character inspections. In [KBN07] 
the authors showed also by empirical results that the new variant performs better 
than the regular binary Boyer-Moore algorithm and even than binary versions 
of the most effective algorithms for classical pattern matching. 

In this note we present two new efficient algorithms for matching on binary 
strings which, despite their Oinm) worst case time complexity, obtain very good 
results in practical cases. The first algorithm is an adaptation of the q-HASH 
algorithm [Lec07] which is among the most efficient algorithms for the standard 
pattern matching problem. We show how the technique adopted by the algorithm 
can be naturally translated to allow for blocks of bits. 

The second solution can be seen as an adaptation to binary string matching of 
the Skip-Search algorithm [CLP98]. This algorithm can be efficiently adapted 
to completely avoid any reference to bits allowing to process pattern and text 
proceeding byte by byte. 

The paper is organized as follows. In Section 2 we introduce basic definitions 
and the terminology used along the article. In Section 3 we introduce a high 
level model used to process binary strings avoiding any reference to bits. Next, 
in Section 4, we introduce the new solutions. Experimental data obtained by 
running under various conditions all the algorithms reviewed are presented and 
compared in Section 5. Finally, wc draw our conclusions in Section 6. 



2 Preliminaries and basic definitions 

A string p of length m > is represented as a finite array p[0 .. m— 1] of characters 
from a finite alphabet S. In particular, for m = we obtain the empty string, 
also denoted by e. By p[i] we denote the (i + l)-th character of p, for < i < m. 
Likewise, by p[i .. j] we denote the substring of p contained between the {i + l)-th 
and the {j + l)-st characters of p, for < i < j < m. Moreover, for any i,j G Z, 
we put p[i .. j] = £ if i > j and p[i .. j] ~ p[max(i, 0), min(j, to — 1)] if i < j. A 
substring of p is also called a factor of p. A substring of the form p[Q ..i] is called 
a prefix of p and a substring of the form p[i .. m — 1] is called a suffix of p for 
< i < TO — 1. For any two strings u and w, we write □ u to indicate that w 
is a suffix of u. Similarly, we write w C u to indicate that w is a prefix of u. 

Let t be a text of length n and let p be a pattern of length to. When the 
character p[0] is aligned with the character t[s] of the text, so that the character 
p[i] is aligned with the character t[s + i], for i = 0, . . . , to — 1, we say that the 
pattern p has shift s in t. In this case the substring t[s .. s + m — 1] is called the 
current window of the text. If t [s .. s + 7ti — 1] = p, we say that the shift s is valid. 

Most string matching algorithms have the following general structure. First, 
during a preprocessing phase, they calculate useful mappings, generally in the 
form of tables, which later are accessed to determine nontrivial shift advance- 
ments. Next, starting with shift s = 0, they look for all valid shifts, by executing 
a matching phase, which determines whether the shift s is valid and computes 
a positive shift increment, As. Such increment As is used to produce the new 
shift s + As to be fed to the subsequent matching phase. 

For instance, in the case of the Naive string matching algorithm, there is 
no preprocessing phase and the matching phase always returns a unitary shift 
increment, i.e. all possible shifts are actually processed. 

3 A High Level Model for Matching on Binary Strings 

A string p over the binary alphabet S = {0, 1} is said to be a binary string 
and is represented as a binary vector p[0..m — 1], whose elements are bits. 
Binary vectors are usually structured in blocks of k bits, typically bytes (fc = 8), 
halfwords (fc = 16) or words (fc = 32), which can be processed at the cost of a 
single operation. If p is a binary string of length to we use the symbol P[i] to 
indicate the {i + l)-th block of p and use p[i] to indicate the (i + l)-th bit of 
p. If i? is a block of fc bits we indicate with symbol Bj the j-th bit of B, with 
< j < k. Thus, for I = 0, . . . , to — 1 we have p[i] = P[[i/k\]i mod fc- 

In this section we present a high level model to process binary strings which 
exploits the block structure of text and pattern to speed up the searching phase 
avoiding to work with bitwise operations. We suppose that the block size fc is 
fixed, so that all references to both text and pattern will only be to entire blocks 
of fc bits. We refer to a fc-bit block as a byte, though larger values than fc = 8 
could be supported as well. The idea to eliminate any reference to bits and to 
proceed block by block has been first suggested in [CKP85] for fast decoding 
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Fig. 1. Let P =110010110010110010110. (A) The matrix Patt. (B) The matrix Mask. 
(C) The array Last. In Patt and Mask bits belonging to P are underlined. Blocks 
containing a factor of P are presented with light gray background color. 



of binary Huffman encoded texts. A similar approach has been adopted also 
in [KBN07,Frc02] . For the sake of uniformity we use in the following, when it is 
possible, the same terminology adopted in [KBN07]. 

Let T[i] and P[i] denote, respectively, the (i + l)th byte of the text and 
of the pattern, starting for i = with both text and pattern aligned at the 
leftmost bit of the first byte. Since the lengths in bits of both text and pattern 
are not necessarily multiples of k, the last byte may be only partially defined. 
In particular if the pattern has length m then its last byte is that of position 
[m/Zc] and only the leftmost (m mod k) bits of the last byte are defined. We 
suppose that the undefined bits of the last byte are set to 0. 

In our high level model we define a sequence of several copies of the pattern 
memorized in the form of a matrix of bytes, Patt, of size k x {\m/k~\ + 1). Each 
row i of the matrix Patt contains a copy of the pattern shifted by i position to 
the right. The i leftmost bits of the first byte remain undefined and are set to 0. 
Similarly the rightmost k — {{m + i) mod k) bits of the last byte are set to 0. 
Formally the j-th bit of byte Patt[i, h] is defined by 

Pattli h] — I ^^^^ ~ * + if < fc/i — i + j < TO 
^ ' \0 otherwise 

for < i < A: and < h < [(to + i)/fc] . 

Observe that each factor of length k of the pattern appears once in the table 
Patt. In particular, the factor of length k starting at position j oip is memorized 
mPatt[k-{j mod fc), [j/fc]]. 

The high level model uses bytes in the matrix Patt to compare the pattern 
block by block against the text for any possible shift of the pattern. However 
when comparing the first or last byte of P against its counterpart in the text, 
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Fig. 2. (A) The Preprocess procedure for the computation of the tables Patt, Mask 
and Last. (B) The Binary-Naive algorithm for the binary string matching problem. 



the bit positions not belonging to the pattern have to be neutralized. For this 
purpose we define a matrix of bytes, Mask, of size k x ([m/fc] -f- 1), eontaining 
binary masks of length k. In particular a bit in the mask Mask[i, h\ is set to 1 if 
and only if the corresponding bit of Patt[i, h] belongs to P. More formally 

,r-ji (lifO<kh — i+j<m 
^«-^^[*''^]^ -(o othe7wise 

for < i < A: and < h < \{m + i)/k'] . 

Finally we need to compute an array, Last, of size k where Last[i] is defined 
to be the index of the last byte in the row Patt[i]. Formally, for < i < fc we 
define Last[i] = \{m + i)/k']. 

The procedure Preprocess used to precompute the tables defined above is 
presented in Figure 2(A). It requires 0{k x \m/k']) ~ 0{m) time and 0{m) 
extra-space. Figure 1 shows the precomputed tables defined above for a pattern 
P =110010110010110010110 of length m = 21 and fc = 8. 

The model uses the precomputed tables to check whether s is a valid shift 
without making use of bitwise operations but processing pattern and text byte 
by byte. In particular, for a given shift position s (the pattern is aligned with 
the s-th bit of the text), we report a match if 

Patt[i, h] = T[j + /i] & Mask[i, h], for h = 0, 1, Last[i]. (1) 

where j = [s/k\ is the starting byte position in the text and i = {s mod k). 

A simple Binary-Naive algorithm, obtained with this high level model, is 
shown in Figure 2(B). The algorithm starts by aligning the left ends of the 
pattern and text. Then, for each value of the shift s = 0, 1, . . . , n — m, it checks 
whether p occurs in t by simply comparing each byte of the pattern with its 
corresponding byte in the text, proceeding from left to right. At the end of 
the matching phase, the shift is advanced by one position to the right. In the 
worstcase, the Binary-Naive algorithm requires 0{\m/k'\n) comparisons. 



4 New Efficient Binary String Matching Algorithms 



In this section we present two new efficient algorithms for matching on binary 
strings based on the high level model presented above. The first algorithm is an 
adaptation of the g-HASH algorithm [Lec07] which is among the most efficient 
algorithms for the standard pattern matching problem. We show how the tech- 
nique adopted by the algorithm can be naturally translated to allow for blocks 
of bits. 

The second solution can be seen as an adaptation to binary string matching of 
the Skip-Search algorithm [CLP98]. This algorithm can be efficiently adapted 
to completely avoid any reference to bits allowing to process pattern and text 
proceeding byte by byte. 

4.1 The Binary-Hash-Matching Algorithm 

Algorithms in the g-HASH family for exact pattern matching have been intro- 
duced in [Lec07] where the author presented an adaptation of the Wu and Man- 
ber multiple string matching algorithm [WM94] to single string matching prob- 
lem. 

The idea of the g-HASH algorithm is to consider factors of the pattern of 
length q. Each substring w of such a length q is hashed using a function hash into 
integer values within and 255. Then the algorithm computes in a preprocessing 
phase a function Hs : {0, 1, ... , 255} — > {0, 1, . . . , to — g}, such that for each 
< c < 255 the value Hs{c) is defined by 

Hs{c) = min ^{0 < k < m — q \ hash{p[m — k — q..m~k~l]) = c}U {to — g}^ . 

The searching phase of the algorithm consists of reading, for each shift s of the 
pattern in the text, the substring w ~ t[s + m ~ q .. s + m — 1] of length q. If 
Hs{hash(w)) > then a shift of length Hs{hash{w)) is applied. Otherwise, when 
Hs{hash{w)) = the pattern p is naively checked in the text. In this case a shift 
of length (to — 1 — i) is applied, where i is the starting position of the rightmost 
occurrence in p of a factor p[j ..j + q— 1] such that hash{p[j ..j + q~ 1]) = 
hash{p[m — g + 1 .. to — 1]). 

If the pattern p is a binary string we can directly associate each substring 
of length g with its numeric value in the range [0, 2"? — 1] without making use 
of the hash function. In order to exploit the block structure of the text we take 
into account substrings of length q = k. This means that, if fc = 8, each block B 
of k bits can be considered as a value < i? < 255. Thus we define a function 
i?s : {0, 1, . . . , 2^ - 1} ^ {0, 1, ... , to}, such that for each byte 0<B <2^ 

Hs{B) = min ^{0 < m < to | p[m — u — k..m — u — 1] □ B} U {to}^ . 

Observe that if _B = p[m — fc .. to — 1] then Hs[B] is defined to be 0. 
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Fig. 3. The Binary-Hash-Matching algorithm for the binary string matching prob- 
lem. 



For example, in the case of the pattern P = 110010110010110010110, pre- 
sented in Figure 1, we have 77s[01100101] = 2, i/s[11001011] = 1, and moreover 
iJs[10010110] = 0. 

The code of the Binary-Hash-Matching algorithm and its preprocessing 
phase are presented in Figure 3. 

The preprocessing phase of the algorithm consists in computing the function 
Hs defined above and requires 0{'m + fc2'^+^) time complexity and 0{m + 2*^) 
extra space. During the search phase the algorithm reads, for each shift position 
s of the pattern in the text, the block B = t[s + m — q .. s + m — 1] of fc bits 
(fines 11-12). If Hs{B) > then a shift of length Hs{B) is applied (fine 20). 
Otherwise, when Hs{B) = the pattern p is naively checked in the text block 
by block (lines 15-18). 

After the test an advancement of length shift is applied (line 19) where 

shift = mill ^{0 < m < to | p[m ~ u ~ k .. m — u — 1] □ p[m — fc .. to — 1]} U {to}^ 

Observe that if the block B has its si rightmost bits in in the j-th block of T and 
the (fc — si) leftmost bits in the block T[j — 1] , then it is computed by performing 
the following bitwise operation 

B = (t[j] » (fc - si)) I (t[j - 1] « {si + 1)) 

The Binary-Hash-Matching algorithm has a 0{\m/k]n) time complexity 
and requires 0{m + 2'^) extra space. 



For blocks of length k the size of the Hs table is 2'', which seems reasonable 
for fc = 8 or even 16. For greater values of k it is possible to adapt the algo- 
rithm to choose the desired time/space tradeoff by introducing a new parameter 
K < k, representing the number of bits taken into account for computing the 
shift advancement. Roughly speaking, only the K rightmost bits of the current 
window of the text arc taken into account, reducing the total sizes of the tables 
to 2^'^ at the cost of sometimes shifting the the pattern less than could be done 
if the full length of a block had been considered. 

4.2 The Binary-Skip-Search Algorithm 

The Skip-Search algorithm has been presented in [CLP98] by Charras, Lecroq 
and Pchoushek. The idea of the algorithm is straightforward. Let p be a pattern 
of length m and t a text of length n, both over a finite alphabet S. For each 
character c of the alphabet, a bucket collects all the positions of that character 
in the pattern. When a character occurs £ times in the pattern, there are £ 
corresponding positions in the bucket of that character. Formally, for c & S the 
Skip-Search algorithm computes the table S[c\ where 

S[c] = {i I < i < A P[i\ = c}. 

It is possible to notice that when the pattern is much shorter than the al- 
phabet, many buckets are empty. The main loop of the search phase consists in 
examining every m-th text character, t[j] (so there will be n/m main iterations). 
For each character it uses each position in the bucket ^[^[j]] to obtain all 
possible starting positions of p in t. For each position the algorithm performs 
a comparison of p with i, character by character, until there is a mismatch, or 
until an occurrence is found. 

For each possible block B oi k bits, a bucket collects all pairs {i, h) in the 
table Patt such that Patt[i, h] — B. When a block of bits occurs more times in 
the pattern, there arc different corresponding pairs in the bucket of that block. 
Observe that for a pattern of length m there arc m — fc + 1 different blocks 
of length k corresponding to the blocks Patt[i,h] such that kh — i > and 
k{h + 1) - i ~ 1 < m. 

However, to take advantage of the block structure of the text, we can compute 
buckets only for blocks contained in the suffix of the pattern of length m' = 
fc[m/fcj. In such a way m' is a multiple of k and we could reduce to examine a 
block for each rn'/fc blocks of the text. 

Formally, for < B < 2^= 

Sk[B] = {{i,h) : {m mod k) < kh - i < m - k A Patt[i, h] ^ B}. 

For example in the case of the pattern P = 110010110010110010110 we 
have 5^(01011001] = {(7,2)}, ^^[OllOOlOl] = {(3, 1), (5, 2)}, Sk[n0010H] = 
{(2, 1)}, 5fc[10010110] = {(1, 1), (3, 2)} and 5^(10110010] = {(4, 2), (6, 2)}. 
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Fig. 4. The Binary-Skip-Search algorithm for the binary string matching problem. 



The Binary-Skip-Search algorithm is shown in Figure 4. Its preprocessing 
phase consists in computing the buckets for all possible blocks of k bits. The 
space and time complexity of this preprocessing phase is 0{m + 2*^). The main 
loop of the search phase consists in examining every (m'/A:)th text block. For 
each block T[j] examined in the main loop, the algorithm inspects each pair 
{i,pos) in the bucket S'fc[r[j]] to obtain a possible alignment of the pattern 
against the text (line 6). For each pair {i,pos) the algorithm checks whether p 
occurs in t by comparing Patt[i, h] and T[j —pos + h], for h ~ 0, . . . , Last[i\ (lines 
7-10). The Binary-Skip-Search algorithm has a 0{\m/k]ri) quadratic worst 
case time complexity and requires 0{m + 2^) extra space. 

In practice, if the block size is fc, the Binary-Skip-Search algorithm re- 
quires a table of size 2^ to compute the function Sk- This is just 256 for k ~ 8, 
but for k = 16 or even 32, such a table might be too large. In particular for 
growing values of fc, there will be many cache misses, with strong impact on 
the performance of the algorithm. Thus for values of k greater than 8 it may be 
suitable to compute the function on the fly, still using a table for single bytes. 
Suppose for example that A: = 32 and suppose S is a block of k bits. Let Bj be 
the j-th byte of S, with j = 1, . . . , 4. The set of all possible pairs associated to 
the block B can be computed as 

Su[B\ = Sum n Sim n Sim n slm 

where we have set Sl[Bj] ^ {[i,h) \ [i,h + q) G Sk[Bj]}. 

If we suppose that the distribution of zeros and ones in the input string 
is like in a randomly generated one, then the probability of occurrence in the 
text of any binary string of length k is 2~^ . This is a reasonable assumption 
for compressed text [KBD89] . Then the expected cardinality of set Sk [B] , for a 
pattern p of length m, is (m — 7) x 2~*, that is less than 2 if m < 500. Thus in 
practical cases the set Sk[B] can be computed in constant expected time. 



5 Experimental Results 



Here we present experimental data which allow to compare, in terms of running 
time and number of text character inspections, the following string matching 
algorithms under various conditions: the Binary-Naive algorithm (BNAIVE) 
of Figure 2, the Binary-Boyer-Moore algorithm by Klein (BBM) presented 
in [KBN07], the Binary-Hash-Matching algorithm (BHM) of Figure 3, and 
the Binary-Skip-Search algorithm (BSKS) of Figure 4. 

For the sake of completeness, for experimental results on running times we 
have also tested the following algorithms for standard pattern matching: the 
g-HASH algorithm [Lec07] with q = 8 (HASH8) and the Extended-BOM al- 
gorithm [FLOS] (EBOM). These are among the most efficient in practical cases. 
The g-HASH and Extended-BOM algorithms have been tested on the same 
texts and patterns but in their standard form, i.e. each character is an ASCII 
value of 8-bit, thus obtaining a comparison between methods on standard and 
binary strings. 

To simulate the different conditions which can arise when processing binary 
data we have performed our tests on texts with a different distribution of zeros 
and ones. For the case of compressed strings it is quite reasonable to assume 
a uniform distribution of characters. For compression scheme using Huffman 
coding, such randomness has been shown to hold in [KBD89] . In contrast when 
processing binary images we aspect a non-uniform distribution of characters. For 
instance in a fax-image usually more than 90% of the total number of bits is set 
to zero. 

All algorithms have been implemented in the C programming language and 
were used to search for the same binary strings in large fixed text buffers on a 
PC with Intel Core2 processor of 1.66GHz. In particular, the algorithms have 
been tested on three Rand(l/0)^ problems, for 7 = 50, 70 and 90. Searching have 
been performed for binary patterns, of length m from 20 to 500, which have been 
taken as substring of the text at random starting positions. 

In particular each Rand(l/0)^ problem consists of searching a set of 1000 
random patterns of a given length in a random binary text of 4 x 10^ bits. The 
distribution of characters depends on the value of the parameter 7. In particular 
bit appears with a percentage equal to 7%. 

Moreover, for each test, the average number of character inspections has been 
computed by taking the total number of times a text byte is accessed (either to 
perform a comparison with the pattern, or to perform a shift) and dividing it 
by the length of the text buffer. 

In the following tables, running times (on the left) are expressed in hun- 
dredths of seconds. Tables with the number of text character inspections (on the 
right) are presented in light-gray background color. Best results are bold faced. 
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Experimental results for a Rand(0/1 
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Experimental results for a Rant 
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Experimental results for a Rand(0/l)9o problem 

Experimental results show that the Binary-Skip-Search and the Binary- 
Hash-Matching algorithms obtain the best run-time performance in all cases. 
In particular it turns out that the Binary-Skip-Search algorithm is the best 
choice when the distribution of character is uniform. In this case the algo- 
rithm is 10 times faster than Binary-Boyer-Moore, and 100 times faster than 
Binary-Naive. Moreover performs less than 50% of inspections performed by 
the Binary-Boyer-Moore algorithm, especially for long patterns. 

For non-uniform distribution of characters the Binary-Hash-Matching 
algorithm obtains the best results in terms of both running time and num- 
ber of character inspections. It turns out to be at least two times faster than 
Binary-Boyer-Moore algorithm and to perform a number of text character 
inspections which is less than 50% of that performed by the Binary-Boyer- 
MOORE algorithm. 



6 Conclusion 



Efficient variants of the g-HASH and Skip-Search pattern matching algorithms 
have been presented for the case in which both text and pattern are over a binary 
alphabet. The algorithm exploit the block structure of the binary strings and 
process text and pattern with no use of any bit manipulations. Both algorithms 
have a 0{n/m) time complexity. However, from our experimental results it turns 
out that the presented algorithms are the most effective in practical cases. 
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